CSHL career development series

Data visualisation with R

Hannah Meyer

January 28, 2020

Background: Antigenic cartography

adapted from (Pattinson 2019)

Background: Antigenic cartography

adapted from (Pattinson 2019)

Background: Antigenic cartography

adapted from (Pattinson 2019), (Smith et al. 2004)

Getting started with Rstudio

Starting an Rproject for your analysis

Starting of with a clean slate

Loading libraries

library("tidyverse")
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("sf")
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library("rnaturalearth")
library("RColorBrewer")

A primer to R functions

function_output <- function(argument1=something, argument2=something_else)
function_output what the function returns
<- the assignment operator
function name of the function
argument1 first argument the function accepts
something our specification to argument1
argument2 second argument the function accepts
something_else our specification to argument2

A primer to R functions

function_output <- function(argument1=something, argument2=something_else)
function_output what the function returns
<- the assignment operator
function name of the function
argument1 first argument the function accepts
something our specification to argument1
argument2 second argument the function accepts
something_else our specification to argument2
coord <- read_csv(file = "data/2004_Science_Smith_data.csv")
function_output coord
<- the assignment operator
function read_csv
argument1 file
something “data/2004_Science_Smith_data.csv”

Reading your data

Data from Smith et al (2004): http://www.antigenic-cartography.org/)

coord <- read_csv("data/2004_Science_Smith_data.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   year = col_double(),
##   cluster = col_character(),
##   type = col_character(),
##   x.coordinate = col_double(),
##   y.coordinate = col_double(),
##   location = col_character(),
##   lat = col_double(),
##   lng = col_double()
## )

Exercises

  1. Use the help function? to have a look at the documentation of read_csv.
  2. Have a look at the coord object by creating a new chunk, typing coord and executing the code chunk.
  3. Do you find the message printed by read_csv represented in coord?

Exercises

  1. Use the help function? to have a look at the documentation of read_csv.
`?`(read_csv)
read_delim {readr}  R Documentation
Read a delimited file (including csv & tsv) into a tibble

Description
read_csv() and read_tsv() are special cases of the general read_delim().
They're useful for reading the most common types of flat file data, comma
separated values and tab separated values, respectively [...]

Exercises

  1. Have a look at the coord object by creating a new chunk, typing coord and executing the code chunk.
coord
## # A tibble: 322 x 9
##    name   year cluster type  x.coordinate y.coordinate location   lat   lng
##    <chr> <dbl> <chr>   <chr>        <dbl>        <dbl> <chr>    <dbl> <dbl>
##  1 BI/1…  1968 HK68    AG            4.05         15.0 BILTHOV…  52.1  5.02
##  2 BI/1…  1968 HK68    AG            4.10         14.8 BILTHOV…  52.1  5.02
##  3 BI/1…  1968 HK68    AG            4.36         13.9 BILTHOV…  52.1  5.02
##  4 BI/8…  1969 HK68    AG            3.87         14.3 BILTHOV…  52.1  5.02
##  5 BI/9…  1969 HK68    AG            4.87         14.1 BILTHOV…  52.1  5.02
##  6 BI/1…  1969 HK68    AG            4.40         14.9 BILTHOV…  52.1  5.02
##  7 BI/9…  1970 HK68    AG            5.06         14.5 BILTHOV…  52.1  5.02
##  8 BI/2…  1970 HK68    AG            4.82         15.5 BILTHOV…  52.1  5.02
##  9 BI/6…  1971 HK68    AG            3.87         15.9 BILTHOV…  52.1  5.02
## 10 BI/2…  1971 HK68    AG            4.27         14.1 BILTHOV…  52.1  5.02
## # … with 312 more rows

Exercises

  1. Do you find the message printed by read_csv represented in coord?

Compare the message printed by read_csv:

Parsed with column specification:
cols(
  name = col_character(),
  year = col_double(),
  cluster = col_character(),
  type = col_character(),
  x.coordinate = col_double(),
  y.coordinate = col_double(),
  location = col_character(),
  lat = col_double(),
  lng = col_double()
)

to the column specification in coord

## # A tibble: 6 x 9
##   name    year cluster type  x.coordinate y.coordinate location   lat   lng
##   <chr>  <dbl> <chr>   <chr>        <dbl>        <dbl> <chr>    <dbl> <dbl>
## 1 BI/15…  1968 HK68    AG            4.05         15.0 BILTHOV…  52.1  5.02
## 2 BI/16…  1968 HK68    AG            4.10         14.8 BILTHOV…  52.1  5.02
## 3 BI/16…  1968 HK68    AG            4.36         13.9 BILTHOV…  52.1  5.02
## 4 BI/80…  1969 HK68    AG            3.87         14.3 BILTHOV…  52.1  5.02
## 5 BI/90…  1969 HK68    AG            4.87         14.1 BILTHOV…  52.1  5.02
## 6 BI/17…  1969 HK68    AG            4.40         14.9 BILTHOV…  52.1  5.02

Data types

The most common data types in R (base R and tidyverse) are:

int integers 1, 2, 3
dbl doubles 1.2, 1.7, 9.0
chr character “a”, “b”, “word”
lgl logical TRUE or FALSE
fctr factors categorical variables with fixed values

Object types

  1. data.frame
##          name year cluster type x.coordinate y.coordinate  location   lat
## 1 BI/15793/68 1968    HK68   AG     4.048064     14.97272 BILTHOVEN 52.14
## 2 BI/16190/68 1968    HK68   AG     4.103302     14.80633 BILTHOVEN 52.14
## 3 BI/16398/68 1968    HK68   AG     4.363448     13.89293 BILTHOVEN 52.14
## 4   BI/808/69 1969    HK68   AG     3.871698     14.25529 BILTHOVEN 52.14
## 5   BI/908/69 1969    HK68   AG     4.868656     14.09319 BILTHOVEN 52.14
## 6 BI/17938/69 1969    HK68   AG     4.400375     14.86012 BILTHOVEN 52.14
##    lng
## 1 5.02
## 2 5.02
## 3 5.02
## 4 5.02
## 5 5.02
## 6 5.02
  1. tibble
## # A tibble: 6 x 9
##   name    year cluster type  x.coordinate y.coordinate location   lat   lng
##   <chr>  <dbl> <chr>   <chr>        <dbl>        <dbl> <chr>    <dbl> <dbl>
## 1 BI/15…  1968 HK68    AG            4.05         15.0 BILTHOV…  52.1  5.02
## 2 BI/16…  1968 HK68    AG            4.10         14.8 BILTHOV…  52.1  5.02
## 3 BI/16…  1968 HK68    AG            4.36         13.9 BILTHOV…  52.1  5.02
## 4 BI/80…  1969 HK68    AG            3.87         14.3 BILTHOV…  52.1  5.02
## 5 BI/90…  1969 HK68    AG            4.87         14.1 BILTHOV…  52.1  5.02
## 6 BI/17…  1969 HK68    AG            4.40         14.9 BILTHOV…  52.1  5.02

Object types

  1. data.frame
  2. tibble
Variables

Variables

Observations

Observations

Exercises:

  1. Which data types are present in our dataset?
  2. How many observations and variables are in our dataset?
  3. What are the variables?

Exercises:

  1. Which data types are present in our dataset? Have a look at the third row of the ouput of:
coord
## # A tibble: 322 x 9
##    name   year cluster type  x.coordinate y.coordinate location   lat   lng
##    <chr> <dbl> <chr>   <chr>        <dbl>        <dbl> <chr>    <dbl> <dbl>
##  1 BI/1…  1968 HK68    AG            4.05         15.0 BILTHOV…  52.1  5.02
##  2 BI/1…  1968 HK68    AG            4.10         14.8 BILTHOV…  52.1  5.02
##  3 BI/1…  1968 HK68    AG            4.36         13.9 BILTHOV…  52.1  5.02
##  4 BI/8…  1969 HK68    AG            3.87         14.3 BILTHOV…  52.1  5.02
##  5 BI/9…  1969 HK68    AG            4.87         14.1 BILTHOV…  52.1  5.02
##  6 BI/1…  1969 HK68    AG            4.40         14.9 BILTHOV…  52.1  5.02
##  7 BI/9…  1970 HK68    AG            5.06         14.5 BILTHOV…  52.1  5.02
##  8 BI/2…  1970 HK68    AG            4.82         15.5 BILTHOV…  52.1  5.02
##  9 BI/6…  1971 HK68    AG            3.87         15.9 BILTHOV…  52.1  5.02
## 10 BI/2…  1971 HK68    AG            4.27         14.1 BILTHOV…  52.1  5.02
## # … with 312 more rows

Exercises:

  1. How many observations and variables are in our dataset? Have a look at the first row of the ouput of:
coord
## # A tibble: 322 x 9
##    name   year cluster type  x.coordinate y.coordinate location   lat   lng
##    <chr> <dbl> <chr>   <chr>        <dbl>        <dbl> <chr>    <dbl> <dbl>
##  1 BI/1…  1968 HK68    AG            4.05         15.0 BILTHOV…  52.1  5.02
##  2 BI/1…  1968 HK68    AG            4.10         14.8 BILTHOV…  52.1  5.02
##  3 BI/1…  1968 HK68    AG            4.36         13.9 BILTHOV…  52.1  5.02
##  4 BI/8…  1969 HK68    AG            3.87         14.3 BILTHOV…  52.1  5.02
##  5 BI/9…  1969 HK68    AG            4.87         14.1 BILTHOV…  52.1  5.02
##  6 BI/1…  1969 HK68    AG            4.40         14.9 BILTHOV…  52.1  5.02
##  7 BI/9…  1970 HK68    AG            5.06         14.5 BILTHOV…  52.1  5.02
##  8 BI/2…  1970 HK68    AG            4.82         15.5 BILTHOV…  52.1  5.02
##  9 BI/6…  1971 HK68    AG            3.87         15.9 BILTHOV…  52.1  5.02
## 10 BI/2…  1971 HK68    AG            4.27         14.1 BILTHOV…  52.1  5.02
## # … with 312 more rows

Exercises:

  1. What are the variables?
name name of virus isolate
year year of isolation
cluster derived cluster
type serum or antigen measurement
x.coordinate x coordinate in antigenic space
y.coordinate y coordinate in antigenic space
location location of virus measurement
lat latitude of location
lng longitude of location

A recipe for generating graphs with ggplot2

p <- ggplot(data = coord)

A recipe for generating graphs with ggplot2

p <- ggplot(data = coord)

p + geom_point(mapping = aes(x = x.coordinate, y = y.coordinate))

Our first plot

p <- ggplot(data = coord)
p + geom_point(mapping = aes(x = x.coordinate, y = y.coordinate))

Exercises

  1. Run ggplot(data = coord). What do you see?

Exercises

  1. Run ggplot(data = coord). What do you see?
ggplot(data = coord)

Exercises

  1. What makes this simple plot look very different from the map that we want to achieve?

  2. What other information in our data object coord could we use?

Exercises

  1. What makes this simple plot look very different from the map that we want to achieve?
    • no color information
    • different background
    • different dimensions
    • different labels
  2. What other information in our data object coord could we use?
    • cluster
    • year
    • location

Aesthetics

[adapted from Wilke (2019) Fundamentals of Data Visualization]

Mapping additional aesthetics

p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = cluster))

Colorbrewer color scales

display.brewer.all()

Color scales:

p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = cluster)) + 
    scale_color_brewer(type = "qual", palette = "Set3")

Shape scales

Manual aesthetics

p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = cluster), 
    shape = 17) + scale_color_brewer(type = "qual", palette = "Set3")

Exercises

  1. Try changing the cluster aesthetic to size and shape. Does this convey the same level of information as a color scale?
  2. What other variable in our dataset would be well represented by a shape scale? Add a shape aesthetic for the variable you identified.
  3. Generally speaking, which type of data lends itself to shape scales, which to size, which to color?

Exercises

  1. Try changing the cluster aesthetic to size and shape. Does this convey the same level of information as a color scale?
p + geom_point(aes(x = x.coordinate, y = y.coordinate, size = cluster))
## Warning: Using size for a discrete variable is not advised.

Exercises

  1. Try changing the cluster aesthetic to size and shape. Does this convey the same level of information as a color scale?
p + geom_point(aes(x = x.coordinate, y = y.coordinate, shape = cluster))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 11. Consider specifying shapes manually if you must have them.
## Warning: Removed 111 rows containing missing values (geom_point).

Exercises

  1. What other variable in our dataset would be well represented by a shape scale? Add a shape aesthetic for the variable you identified.
p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = cluster, 
    shape = type))

Exercises

  1. Generally speaking, which type of data lends itself to shape scales, which to size, which to color?
    • shape: distinct classes, moderate (up to 6) number of classes
    • size:
      • continous data that is either sequential or diverging
    • color:
      • distinct classes, moderate to high (up to 12) number of classes
      • continous data that is either sequential or diverging

Exercises

  1. Why does this not work?
p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = cluster)) + 
    scale_color_brewer(type = "qual", palette = "Set1")
  1. Why does this code not color all points in blue?
p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = "blue"))
  1. Advanced: Change the overall shape of points to number 24 triangles, then color by cluster and fill by type.

Exercises

  1. Why does this not work?
p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = cluster)) + 
    scale_color_brewer(type = "qual", palette = "Set1")

-> Not enough classes in palette

Exercises

  1. Why does this code not color all points in blue?
p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = "blue"))

-> color is inside the aesthetics mapping; if manually setting colors, move outside of aes

p + geom_point(aes(x = x.coordinate, y = y.coordinate), color = "blue")

Exercises

  1. Advanced: Change the overall shape of points to number 24 triangles, then color by cluster and fill by type.
p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = cluster, 
    fill = type), shape = 24)

Coordinate system

p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = cluster)) + 
    scale_color_brewer(type = "qual", palette = "Set3") + coord_fixed()

Labels

p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = cluster)) + 
    scale_color_brewer(type = "qual", palette = "Set3") + labs(x = "Dimension 1 [AU]", 
    y = "Dimension 2 [AU]", title = "Antigenic cartography", 
    color = "Cluster") + coord_fixed()

Themes

p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = cluster)) + 
    scale_color_brewer(type = "qual", palette = "Set3") + labs(x = "Dimension 1 [AU]", 
    y = "Dimension 2 [AU]", title = "Antigenic cartography", 
    color = "Cluster") + coord_fixed() + theme_bw()

Exercises

  1. What other coordinate system options exist? Hint: type ?coord_ in a new chunk and press tab to see other options.
  2. In the previous set of exercises, you added a shape aesthetic. Rename the legend title for this aesthetic
  3. Test different themes and see how it effects the plot, for instance use theme_void(), theme_dark() and theme_classic(). Similar to Exercise 1, you can type ?theme_ and tab to see other possible build in themes.
  4. Why does this not work?
p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = cluster)) + 
    scale_color_brewer(type = "qual", palette = "Set3") + labs(x = "Dimension 1 [AU]", 
    y = "Dimension 2 [AU]", title = "Antigenic cartography", 
    color = "Cluster") + coord_fixed()
+theme_light()

Exercises

  1. What other coordinate system options exist? Hint: type ?coord_ in a new chunk and press tab to see other options.

Exercises

  1. In the previous set of exercises, you added a shape aesthetic. Add this aesthetic here and rename its legend title.
p + geom_point(aes(x = x.coordinate, y = y.coordinate, shape = type, 
    color = cluster)) + scale_color_brewer(type = "qual", palette = "Set3") + 
    labs(x = "Dimension 1 [AU]", y = "Dimension 2 [AU]", title = "Antigenic cartography", 
        color = "Cluster", shape = "Measurement") + coord_fixed() + 
    theme_bw()

Exercise

  1. Test different themes and see how it effects the plot, for instance use theme_void(), theme_dark() and theme_classic(). Similar to Exercise 1, you can type ?theme_ and tab to see other possible build in themes.

Exercise

p + geom_point(aes(x = x.coordinate, y = y.coordinate, shape = type, 
    color = cluster)) + scale_color_brewer(type = "qual", palette = "Set3") + 
    labs(x = "Dimension 1 [AU]", y = "Dimension 2 [AU]", title = "Antigenic cartography", 
        color = "Cluster", shape = "Measurement") + coord_fixed() + 
    theme_void()

Exercise

p + geom_point(aes(x = x.coordinate, y = y.coordinate, shape = type, 
    color = cluster)) + scale_color_brewer(type = "qual", palette = "Set3") + 
    labs(x = "Dimension 1 [AU]", y = "Dimension 2 [AU]", title = "Antigenic cartography", 
        color = "Cluster", shape = "Measurement") + coord_fixed() + 
    theme_dark()

Exercise

p + geom_point(aes(x = x.coordinate, y = y.coordinate, shape = type, 
    color = cluster)) + scale_color_brewer(type = "qual", palette = "Set3") + 
    labs(x = "Dimension 1 [AU]", y = "Dimension 2 [AU]", title = "Antigenic cartography", 
        color = "Cluster", shape = "Measurement") + coord_fixed() + 
    theme_classic()

Exercise

  1. Why does this not work?
p + geom_point(aes(x = x.coordinate, y = y.coordinate, color = cluster)) + 
    scale_color_brewer(type = "qual", palette = "Set3") + labs(x = "Dimension 1 [AU]", 
    y = "Dimension 2 [AU]", title = "Antigenic cartography", 
    color = "Cluster") + coord_fixed()
+theme_light()

As indicated in error message:

Error: Cannot use `+.gg()` with a single argument. Did you accidentally put
+ on a new line?

Saving your plots

ggsave(filename = "results/antigenic_cartography.pdf")
## Saving 8 x 6 in image
ggsave(filename = "results/antigenic_cartography.pdf", width = 10, 
    height = 16, unit = "cm")

Note: ggsave overwrites the previous file of that name without warning!

Exercises

  1. Save the plot as png and jpeg.
  2. Change the size of the plot with width and height; what happens to figure labels and legends?

Exercises

  1. Save the plot as png and jpeg.
ggsave(filename = "results/antigenic_cartography.png", width = 10, 
    height = 16, unit = "cm")

ggsave(filename = "results/antigenic_cartography.jpeg", width = 10, 
    height = 16, unit = "cm")

Exercises

  1. Change the size of the plot with width and height; what happens to figure labels and legends?

-> labels and legend stay legible, make sure to always choose the right text sizes and image sizes in combination

Bar charts

p + geom_bar(aes(x = year, fill = cluster), position = position_dodge(preserve = "single")) + 
    scale_fill_brewer(type = "qual", palette = "Set3") + theme_bw()

Histograms

p + geom_histogram(aes(x = year, fill = cluster), position = position_dodge(preserve = "single"), 
    binwidth = 10) + scale_fill_brewer(type = "qual", palette = "Set3") + 
    theme_bw()

Boxplots

p + geom_boxplot(aes(x = type, y = year, color = type)) + geom_jitter(aes(x = type, 
    y = year, color = type)) + theme_bw()

Geographical maps

world <- ne_countries(scale = "medium", returnclass = "sf")

g <- ggplot()
g + geom_sf(data = world) + geom_point(data = coord, aes(x = lng, 
    y = lat, color = cluster)) + scale_color_brewer(type = "qual", 
    palette = "Set3")

Exercises

  1. Test different options for the position argument of geom_bar. Hint: use the Details paragraph in ?geom_bar to find a description about possible options.
  2. What happens when you choose preserve="total" in position_dodge of geom_histogram?
  3. Customise the color scale and plot labels in the boxplot showing the distribution of measurements per type and year. What happens if you choose aes(fill) instead of aes(color)?
  4. Change geom_jitter to geom_point to see why geom_jitter is a better visualisation of the data. Go back to using geom_jitter and play with the width argument to customise your plot.
  5. What would be a good theme for the world map? Add it to the plot.

Exercises

  1. Test different options for the position argument of geom_bar. Hint: use the Details paragraph in ?geom_bar to find a description about possible options.

    Details

    By default, multiple bars occupying the same x position will be stacked atop one another by position_stack(). If you want them to be dodged side-to-side, use position_dodge() or position_dodge2(). Finally, position_fill() shows relative proportions at each x by stacking the bars and then standardising each bar to have the same height.

p + geom_bar(aes(x = year, fill = cluster), position = position_stack()) + 
    scale_fill_brewer(type = "qual", palette = "Set3") + theme_bw()

Exercises

  1. What happens when you choose preserve="total" in position_dodge of geom_histogram?

Hint: In the help function for geom_histogram, click on position_dodge to get to the help for this function. From there, you can see:

preserve    Should dodging preserve the total width of all elements at a
            position, or the width of a single element?
            
p + geom_histogram(aes(x = year, fill = cluster), position = position_dodge(preserve = "total"), 
    binwidth = 10) + scale_fill_brewer(type = "qual", palette = "Set3") + 
    theme_bw()

# Exercises

  1. Customise the color scale and plot labels in the boxplot showing the distribution of measurements per type and year. What happens if you choose aes(fill) instead of aes(color)?
p + geom_boxplot(aes(x = type, y = year, fill = type)) + geom_jitter(aes(x = type, 
    y = year, color = type)) + scale_fill_manual(values = c("#66c2a5", 
    "#fc8d62")) + labs(x = "Measurement", y = "Time", color = "Measurement") + 
    theme_bw()

Exercises

  1. Change geom_jitter to geom_point to see why geom_jitter is a better visualisation of the data. Go back to using geom_jitter and play with the width argument to customise your plot.
p + geom_boxplot(aes(x = type, y = year, color = type)) + geom_point(aes(x = type, 
    y = year, color = type)) + theme_bw()

Exercises

  1. What would be a good theme for the world map? Add it to the plot.
world <- ne_countries(scale = "medium", returnclass = "sf")

g <- ggplot()
g + geom_sf(data = world) + geom_point(data = coord, aes(x = lng, 
    y = lat, color = cluster)) + scale_color_brewer(type = "qual", 
    palette = "Set3") + theme_void()

Additional material

References

Smith, Derek J., Alan S. Lapedes, Jan C. de Jong, Theo M. Bestebroer, Guus F. Rimmelzwaan, Albert D. M. E. Osterhaus, and Ron A. M. Fouchier. 2004. “Mapping the Antigenic and Genetic Evolution of Influenza Virus.” Science 305 (5682). American Association for the Advancement of Science: 371–76. https://doi.org/10.1126/science.1097211.

Wilke, Claus O. 2019. Fundamentals of Data Visualization. 1st ed. O’Reilly Media, Inc. https://serialmentor.com/dataviz/.